The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), āpatternedā precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35).
The gene expression count data was obtained from Kallisto (Bray et
al., 2016) using the Gencode hg38 release Homo sapiens transcriptome.
The quantification of alternative splicing was performed using VAST-TOOLS (Tapial et
al., 2017). The data required for this practical session can be
downloaded from Zenodo
(please note this is a different version of the one in the
previous module).
load("./AS_GE_data.RData")
#Data:
# info : data-frame of information for the nuclear samples
# myE_gen : noramlised gene expression count matrix of the CTRL nuclear fraction quantile-normalised
# dat_diff_time_nuc_ctrl : data-frame with different AS analysis for the nuclear compartment
# dat_diff_time_cyto_ctrl : data-frame with different AS analysis for the cytoplasmic compartment
# tab_psi : data frame with all events in VAST-TOOLS and their PSI values in each sample
# Percentage or proportion spliced-in (PSI or ĪØ)
#Here I make some nice colors to facilitate the visualisation
mygroup <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days) <- c(0,3,7,14,22,35)
mycols <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))
#Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
gostres_diff <- gost(query = genes_list,
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", as_short_link = FALSE,sources=c("GO:BP", "GO:MF","KEGG"))
gostplot(gostres_diff, capped = FALSE, interactive = TRUE)#please note this is going to create an interactive plot
}
#Plot PSI events over time in nuclear cyto fractions
plotEventOverTime <- function(event="HsaINT0002504"){
toi <- c("E.dPsi.d_3_0","E.dPsi.d_7_0","E.dPsi.d_14_0","E.dPsi.d_22_0","E.dPsi.d_35_0")
dat_nuc <- dat_diff_time_nuc_ctrl[match(event,dat_diff_time_nuc_ctrl$EVENT),match(toi,colnames(dat_diff_time_nuc_ctrl))]
dat_cyto<- dat_diff_time_cyto_ctrl[match(event,dat_diff_time_cyto_ctrl$EVENT),match(toi,colnames(dat_diff_time_cyto_ctrl))]
tempdat <- rbind(as.numeric(dat_nuc),as.numeric(dat_cyto))
rownames(tempdat)<-c("nuc","cyto")
colnames(tempdat)<- c("3","7","14","22","35")
tempdat[is.na(tempdat)]<-0
barplot(tempdat,beside=TRUE,las=1,ylab="dPSI",xlab="DIV",main=paste(tab_psi$GENE[match(event,tab_psi$EVENT)], "-", tab_psi$COORD[match(event,tab_psi$EVENT)]),col=c("#CC9C3D","#476DB4"))
legend("topleft",ncol=1,pch=15,col=c("#CC9C3D","#476DB4"),leg=c("nuc","cyto"),cex=0.7)
grid()
}
#To following function sort myvec in decreasing order and return their indexes:
myidx<- sort(x=c(1:2000),index.return=TRUE,decreasing=TRUE)$ix
dat_diff_time_nuc_ctrl data-table) and cytoplasmic
fractions (dat_diff_time_cyto_ctrl). The schematic below
shows the four main types of events we are going to look at.
Here is a summary of the column content which you can find at on Github repo of VAST-TOOL:
First you always need to make sure you understand the format of your
data, the content of the columns and the rows. Start by checking the
dimensions of the data using dim, nrow,
ncol functions. Then if the data-count table is not too
large, you can use the View function to visualise and
explore its content. Alternatively you can print a couple
of rows. Here you need to understand the output from VAST-TOOLS.
dim(tab_psi)
## [1] 365711 102
print(colnames(tab_psi))
## [1] "GENE" "EVENT" "COORD" "LENGTH"
## [5] "FullCO" "COMPLEX" "0_CB1D_cyto" "0_CB1D_nuc"
## [9] "0_CB1E_cyto" "0_CB1E_nuc" "0_CTRL1_cyto" "0_CTRL1_nuc"
## [13] "0_CTRL3_cyto" "0_CTRL3_nuc" "0_CTRL4_cyto" "0_CTRL4_nuc"
## [17] "0_CTRL5_cyto" "0_CTRL5_nuc" "0_GliA_cyto" "0_GliA_nuc"
## [21] "0_GliB_cyto" "14_CB1D_cyto" "14_CB1D_nuc" "14_CB1E_cyto"
## [25] "14_CB1E_nuc" "14_CTRL1_cyto" "14_CTRL1_nuc" "14_CTRL3_cyto"
## [29] "14_CTRL3_nuc" "14_CTRL4_cyto" "14_CTRL4_nuc" "14_CTRL5_cyto"
## [33] "14_CTRL5_nuc" "14_GliA_cyto" "14_GliA_nuc" "14_GliB_cyto"
## [37] "14_GliB_nuc" "22_CB1D_cyto" "22_CB1D_nuc" "22_CB1E_cyto"
## [41] "22_CB1E_nuc" "22_CTRL1_cyto" "22_CTRL1_nuc" "22_CTRL3_cyto"
## [45] "22_CTRL3_nuc" "22_CTRL4_cyto" "22_CTRL4_nuc" "22_CTRL5_cyto"
## [49] "22_CTRL5_nuc" "22_GliA_cyto" "22_GliA_nuc" "22_GliB_cyto"
## [53] "22_GliB_nuc" "35_CB1D_cyto" "35_CB1D_nuc" "35_CB1E_cyto"
## [57] "35_CB1E_nuc" "35_CTRL1_cyto" "35_CTRL1_nuc" "35_CTRL3_cyto"
## [61] "35_CTRL3_nuc" "35_CTRL4_cyto" "35_CTRL4_nuc" "35_CTRL5_cyto"
## [65] "35_CTRL5_nuc" "35_GliA_cyto" "35_GliA_nuc" "35_GliB_cyto"
## [69] "35_GliB_nuc" "3_CB1D_cyto" "3_CB1D_nuc" "3_CB1E_cyto"
## [73] "3_CB1E_nuc" "3_CTRL1_cyto" "3_CTRL1_nuc" "3_CTRL3_cyto"
## [77] "3_CTRL3_nuc" "3_CTRL4_cyto" "3_CTRL4_nuc" "3_CTRL5_cyto"
## [81] "3_CTRL5_nuc" "3_GliA_cyto" "3_GliA_nuc" "3_GliB_cyto"
## [85] "3_GliB_nuc" "7_CB1D_cyto" "7_CB1D_nuc" "7_CB1E_cyto"
## [89] "7_CB1E_nuc" "7_CTRL1_cyto" "7_CTRL1_nuc" "7_CTRL3_cyto"
## [93] "7_CTRL3_nuc" "7_CTRL4_cyto" "7_CTRL4_nuc" "7_CTRL5_cyto"
## [97] "7_CTRL5_nuc" "7_GliA_cyto" "7_GliA_nuc" "7_GliB_cyto"
## [101] "7_GliB_nuc" "COMPLEX_short"
dim(dat_diff_time_nuc_ctrl)
## [1] 365711 21
print(colnames(dat_diff_time_cyto_ctrl))
## [1] "GENE" "EVENT" "COORD" "LENGTH"
## [5] "FullCO" "COMPLEX" "E.dPsi.d_3_0" "MV.dPsi.d_3_0"
## [9] "E.dPsi.d_7_0" "MV.dPsi.d_7_0" "E.dPsi.d_7_3" "MV.dPsi.d_7_3"
## [13] "E.dPsi.d_14_0" "MV.dPsi.d_14_0" "E.dPsi.d_22_0" "MV.dPsi.d_22_0"
## [17] "E.dPsi.d_35_0" "MV.dPsi.d_35_0" "E.dPsi.d_35_22" "MV.dPsi.d_35_22"
## [21] "COMPLEX_short"
dim(dat_diff_time_cyto_ctrl)
## [1] 365711 21
print(colnames(dat_diff_time_nuc_ctrl))
## [1] "GENE" "EVENT" "COORD" "LENGTH"
## [5] "FullCO" "COMPLEX" "E.dPsi.d_3_0" "MV.dPsi.d_3_0"
## [9] "E.dPsi.d_7_0" "MV.dPsi.d_7_0" "E.dPsi.d_7_3" "MV.dPsi.d_7_3"
## [13] "E.dPsi.d_14_0" "MV.dPsi.d_14_0" "E.dPsi.d_22_0" "MV.dPsi.d_22_0"
## [17] "E.dPsi.d_35_0" "MV.dPsi.d_35_0" "E.dPsi.d_35_22" "MV.dPsi.d_35_22"
## [21] "COMPLEX_short"
It is generally good to get an understanding of the type of events which are differentially spliced between two conditions. Pie-charts can be useful to visualise the relative abundances of events in each categories between different types of conditions. \(\Delta\)PSI>0 indicates inclusion while \(\Delta\)PSI<0 indicates skipping between two conditions. The default threshold of a difference in PSI is 15%.
Letās start by looking at the relative number of events with a threshold difference of 20% between the iPSC stage (DIV=0) and the MN stage (DIV=35) in the nucleus:
col_events=c("#CD4599","#8C8C8C","#6ABD45","#4B5493","#FEC20F")
names(col_events) <- names(table(dat_diff_time_nuc_ctrl$COMPLEX_short))
par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="NUCLEUS -- DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)
And compare these with event distribution in cytoplasmic fractions:
par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="CYTOPLASM --DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)
What are the key observations?
Having found some differences at MN stages in relative distributions of included events between cyto and nuc (enrichment in nuclear fraction of IR), letās now test whether splicing patterns are similar in nucleus versus cytoplasm over time.
pcc_events<- do.call(what=cbind,args=lapply(unique(dat_diff_time_cyto_ctrl$COMPLEX_short),function(EV)return(unlist(lapply(c("E.dPsi.d_3_0", "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0"),function(coln)return(cor(dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_nuc_ctrl))],dat_diff_time_cyto_ctrl[dat_diff_time_cyto_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_cyto_ctrl))],use = "complete")))))))
colnames(pcc_events)<- unique(dat_diff_time_cyto_ctrl$COMPLEX_short)
rownames(pcc_events)<-c("E.dPsi.d_3_0", "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0")
par(mfrow=c(1,1))
barplot(pcc_events,beside=TRUE,col=unlist(lapply(colnames(pcc_events),function(Z)return(rep(col_events[match(Z,names(col_events))],5)))),las=1,ylim=c(0,1),ylab="Pearson Correlation Coefficient")
Letās now visualise these results, in particular the changes between NPC and MN stages in terms of alternative exon usage and intron retention:
par(mfrow=c(2,2),mar=c(4,4,4,4))
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
grid()
abline(0,1,col="red",lty=2)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)
If interested have a look at the role of nuclear detention of intron-retaining transcripts.
While hierarchical clustering might be dominated by some changes in few genes, other methods such as singular value decomposition (SVD) analysis can be instrumental in identifying independent biological pathways underlying changes in gene expression. It is also very appropriate for time-series analysis when variation is not linearly dependent on time.
#This function extract the fraction of variance per component as derived from SVD analysis
#Input= SVD object
getFractionVariance<- function(mySVD){
return(mySVD$d*mySVD$d/sum(mySVD$d*mySVD$d))
}
#This function calculate the Shannon Entropy which indicates how well the information is spread among the principal components.
#High value indicates information evenly distributed among components
#Low value indicates most information/variance is captured by a single component
#Takes as input the fraction of variance outputted from getFractionVariance
getShannonEntropy <- function(pip)return(-1*sum(pip*log10(pip))/log10(length(pip)))
#This function plots the fraction of variance captured by first N components.
#pi=fraction of variance as obtained from getFractionVariance
#dp=Shannon Entropy as obtained from getShannonEntropy
#N=number of components to plot
ScrePlot <- function(pi,dp,N=NA){
if(is.na(N)){
barplot(pi,las=1,cex.main=0.7,cex.axis=1.0,col="black")
}
else{
barplot(pi[c(1:N)],las=1,cex.main=0.7,cex.axis=1.0,col="black")
}
mtext(side = 1, line = 2, text ="principal components", col = "black",cex=0.7, font=1)
mtext(side = 2, line = 2, text ="Fraction of explained variance", col = "black",cex=0.7, font=1)
mtext(side = 3, line = -2, text = paste("Shannon Entropy = ",round(dp,digits=3)), col = "black",cex=0.7, font=1)
}
For this analysis we will focus on the nuclear samples and use the different types of data collected so far i.e.Ā gene expression, PUD, IR and AltEx. A similar analysis can be done on the cytoplasmic fraction.
ge <- myE_gen
EX_nuc <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="EX"),match(colnames(myE_gen),colnames(tab_psi))])
IR_nuc <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="IR"),match(colnames(myE_gen),colnames(tab_psi))])
rownames(EX_nuc) <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="EX")]
rownames(IR_nuc) <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="IR")]
#You need to remove NA's and NaN for the unsupervised analysis
tosel_ex <- which(apply(is.na(EX_nuc),1,sum)==0&apply(is.nan(EX_nuc),1,sum)==0)
tosel_ir <- which(apply(is.na(IR_nuc),1,sum)==0&apply(is.nan(IR_nuc),1,sum)==0)
EX_nuc <- EX_nuc[tosel_ex,]
IR_nuc <- IR_nuc[tosel_ir,]
It is common in SVD analysis to remove the first component which captures noise == most variations from from systematic changes in basal gene expression between genes.
#SVD on the Gene Expression
dat1 <- ge
SVD_eset <- svd(dat1)
pi_1 <- getFractionVariance(mySVD=SVD_eset)
dp_1 <- getShannonEntropy(pi_1)
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE <- svd(datm)
pi_2 <- getFractionVariance(mySVD=SVD_eset_GE)
dp_2 <- getShannonEntropy(pi_2)
par(mfrow=c(2,4))
ScrePlot(pi_1,dp_1,N=NA)
mtext(side=3,line=0,text="prior removing first component")
#Effect on the sample distribution
multidensity(dat1[,c(1,10,15,20)],main="prior filtering",las=1)
boxplot(dat1,outline=FALSE)
boxplot(t(dat1[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")
ScrePlot(pi_2,dp_2,N=NA)
mtext(side=3,line=0,text="after removing first component")
multidensity(datm[,c(1,10,15,20)],main="after filtering",las=1)
boxplot(datm,outline=FALSE)
#Effect on the genes distribution
boxplot(t(datm[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")
We start by performing SVD on individual matrices and compare how information is captured in the different matrices.
layout(matrix(c(1:6),ncol=3,nrow=2,byrow=FALSE))
#1. SVD on the Gene Expression
dat1 <- ge
SVD_eset <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="Gene Expression")
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_GE),getShannonEntropy(getFractionVariance(SVD_eset_GE)))
mtext(side=3,line=0,text="Gene Expression")
#2. SVD on the AltEx
dat1 <- EX_nuc
SVD_eset <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="AltEx")
mtext(side=3,line=1,text="Prior removing first component")
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_Ex <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_Ex),getShannonEntropy(getFractionVariance(SVD_eset_Ex)))
mtext(side=3,line=1,text="After removing first component")
mtext(side=3,line=0,text="AltEx")
#3. SVD on the IR
dat1 <- IR_nuc
SVD_eset <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="IR")
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_IR <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_IR),getShannonEntropy(getFractionVariance(SVD_eset_IR)))
mtext(side=3,line=0,text="IR")
Letās now compare their first PCs to test who the time is captured by the different matrices in the different SVD.
#PCA plots
par(mfrow=c(3,3),mar=c(4,4,2,2))
plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="IR",pch=19,cex=1.5,las=1)
plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)
plot(SVD_eset_GE$v[,2], SVD_eset_GE$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,2], SVD_eset_Ex$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,2], SVD_eset_IR$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)
Do you see any differences in the clustering of the samples using the different count matrices?
Letās now look into the dynamic over time of the different component:
myMean <- list( t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))) )
mySD <- list(t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))) )
#Plot component over time
days <- c(0,3,7,14,21,35)
cats <- c("Gene Expression","AltEx","IR","3UTR")
CEX<- 0.7
par(mfrow=c(3,3),mar=c(5,5,2,0),oma=c(2,2,2,2))
for(j in c(1:3)){
for(i in c(1:3)){
MIN=min(0,min(myMean[[j]][i,]-mySD[[j]][i,]))
MAX=max(myMean[[j]][i,]+mySD[[j]][i,])
plot(days,myMean[[j]][i,],pch=19,type="b",lty=2,ylim=c(MIN,MAX),las=1,frame="F",xlab="time [days]",cex=1.0,cex.axis=CEX,cex.lab=CEX,ylab="")
mtext(side=2,line=3,text=paste("PC",i,sep=""),cex=CEX)
mtext(side=3,line=0,text=cats[j],cex=CEX)
grid()
abline(h=0,col="red",lty=2)
}
}
Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.
# We use the Deseq2 method for all samples
CEX <- 1
hc_ge <- hclust(dist(t(ge),method="man"), method = "ward.D", members = NULL)
hc_EX_nuc <- hclust(dist(t(EX_nuc),method="man"), method = "ward.D", members = NULL)
hc_IR_nuc <- hclust(dist(t(IR_nuc),method="man"), method = "ward.D", members = NULL)
par(mfrow=c(1,3))
par(xpd=NA,oma=c(0,0,2,0))
plot(ape::as.phylo(hc_ge),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="Gene expression")
plot(ape::as.phylo(hc_EX_nuc),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="AltEx")
plot(ape::as.phylo(hc_IR_nuc),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="IR")
In all these different hierarchical clustering it is possible to see that the samples from the same DIV are very similar and they are the first ones to be merged.
Considering gene expression plot, the most similar are DIV 3 and 7. The second most similar are DIV 22 and 35. Then DIV 0 is merged with DIV {3, 7} and then DIV 14 is merged with DIV {0, 3, 7}. The last two clusters to be merged are DIV {0, 3, 7, 14} and DIV {22, 35} and these two are very far compared to all the other merges.
Considering alternative exon, we still have that the last two clusters to be merged are DIV {0, 3, 7, 14} and DIV {22, 35}, which are quite far apart. What changes is that DIV 14 is merged with cluster {0, 3} before 0 (while in gene expression DIV 0 was merged with {0, 3} before DIV 14).
Considering intro retention, we still have that the last two clusters to be merged are DIV {0, 3, 7, 14} and DIV {22, 35}, which are quite far apart. In this case, DIV 7 and 14 are marged at first, then DIV 3 and at the end DIV 0.
# Use left singular matrix of the SVD for this task
SVD_eset <- svd(IR_nuc)
# Visualise the impact of the different PC components on the fraction of explained variance
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="IR before noise removal")
The main source of variance is given by the noise, so we remove the first component.
# Remove the first component
datm <- IR_nuc-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
# Visualise the impact of the different PC components on the fraction of explained variance
SVD_eset_after <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_after),getShannonEntropy(getFractionVariance(SVD_eset_after)))
mtext(side=3,line=0,text="IR after noise removal")
# For the first 5 PC components, we need to sort the vectors to obtain the 500 most positively and negatively contributing genes. The PC components in SVD_eset are already sorted.
indeces <- c(1,2,3,4,5)
colors <- rep(c("blue", "red", "green", "orange", "purple", "yellow"), each = 4)
times <- rep(c("DIV0", "DIV3", "DIV7", "DIV14", "DIV22", "DIV35"), each = 4)
first_500 <- lapply(indeces, function(index) {
# Extract the scores
PC <- SVD_eset_after$u[,index]
# Get the ordered indices
sorted_indices <- sort(x=PC, index.return=TRUE, decreasing=TRUE)$ix
# Get the indices of the top positive and top negative
top_positively_indices <- sorted_indices[1:500]
top_negatively_indices <- tail(sorted_indices, 500)
# Get the rows from IR_nuc corresponding to the top positive and top negative genes
most_positive_nuc <- IR_nuc[top_positively_indices,]
most_negative_nuc <- IR_nuc[top_negatively_indices,]
par(mfrow=c(2, 1))
# Plot the most positive and most negative genes
boxplot(most_positive_nuc, main = paste0("Top 500 most positively contributing genes, PC component: ", index), ylab = "PSI", xlab = "Time points", col = colors, las = 2, names=times)
boxplot(most_negative_nuc, main = paste0("Top 500 most negatively contributing genes, PC component: ", index), ylab = "PSI", xlab = "Time points", col = colors, las = 2, names=times)
most_positive_nuc_events <- rownames(most_positive_nuc)
most_negative_nuc_events <- rownames(most_negative_nuc)
# Get the name of the genes, for the GO analysis
most_positive_nuc_genes <- dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% most_positive_nuc_events, ]$GENE
most_negative_nuc_genes <- dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% most_negative_nuc_events, ]$GENE
list(positive = most_positive_nuc_genes, negative = most_negative_nuc_genes)
})
Comment: The first component is the one that primarily considers time. As time progresses, the mean value increases for the negatively contributing genes, and it decreases for positive contributing genes.
Perform BP enrichment analysis on relevant PCs as identified in previous point.
GO_analysis(first_500[[1]]$positive)
The 500 genes most positively contributing to the principal component 1 are related to transport and localization.
GO_analysis(first_500[[1]]$negative)
The 500 genes most negatively contributing to the principal component 1 are related to organization processes.
GO_analysis(first_500[[2]]$positive)
The 500 genes most positively contributing to the principal component 2 are related to organelle organization, transport and localization.
GO_analysis(first_500[[2]]$negative)
The 500 genes most negatively contributing to the principal component 2 are related to organelle organization, metabolic processes and regularization processes.
GO_analysis(first_500[[3]]$positive)
The 500 genes most positively contributing to the principal component 3 are related to developmental processes.
GO_analysis(first_500[[3]]$negative)
The 500 genes most negatively contributing to the principal component 3 are related to organelle organization, transport, vesicle-mediated transport, actin cytoskeleton organization, actin filament-based process.
GO_analysis(first_500[[4]]$positive)
The 500 genes most positively contributing to the principal component 4 are related to regulation processes, localization and neurotransmitter receptor transport.
GO_analysis(first_500[[4]]$negative)
The 500 genes most negatively contributing to the principal component 4 are related to developmental processes (nervous system development, system development, multicellular organism development) and cell morphogenesis.
GO_analysis(first_500[[5]]$positive)
The 500 genes most positively contributing to the principal component 5 are related to cellular response to stress, regulation processes and developmental processes
GO_analysis(first_500[[5]]$negative)
The 500 genes most negatively contributing to the principal component 5 are related to regulation processes, modification processes, organic cyclic compound biosynthetic process.